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(54) Unicode conversion into multiple encodings 

(57) Techniques to converting source text (e.g.. Uni- 
code text) to multiple different encodings are disclosed. 
The disclosed techniques operate without any font or 
style information that could suggest the original encod- 
ing types. For a given source text, the techniques intelli- 
gently determine which of a variety of available target 
encodings are most appropriate. The determination of 
the most appropriate target encodings is flexible 
enough to accommodate different criteria or tolerance 
levels in performing the conversion as may be desired. 
The conversion out of Unicode into multiple different 
encodings also requires the determination of where and 
when to switch between the available target encodings. 
Also disclosed is a technique to automatically identify 
those target encoding that are available. 
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Description 

BACKGROUND OF THE INVENTION 

1 . Field of the Inverrti n 

[0001] The present invention relates to a system for 
converting between character codes for printed or dis- 
played text and, more particularly, to a code converter 
for converting one character set to multiple character 
sets. 

2. Description of the Related Art 

[0002] Computers and other electronic devices typi- 
cally use text to interact with users. The text is usually 
displayed on a monitor or some other type of display 
device. Because the text must be represented in digital 
form inside the computer or other electronic device, a 
character set encoding must be used. Generally speak- 
ing, a character set encoding operates to encode each 
character of the character set with a unique digital rep- 
resentation. The characters (which are encoded) corre- 
spond to letters, numbers and various text symbols and 
are assigned numeric codes for use by computers or 
other electronic devices. The most popular character 
set for use with computers and other electronic devices 
is the American Standard Code for Information 
Exchange (ASCII). ASCII uses 7-bit sequences for its 
encodings. In other countries, different character sets 
are used. In Europe, the dominant character encoding 
standards are the ISO 8859-X family, especially ISO 
8859-1 (called "Latin-1") developed by the International 
Standards Organization (ISO). In Japan, the dominant 
character encoding standard is J IS X0208 where J IS 
refers to the Japanese Information Standard and was 
developed by Japan Standards Association (JSA). 
Examples of other existing character sets include Mac™ 
OS Standard Roman encoding (by Apple Computer, 
Inc.). Shift-JIS (Japan), Big5 (Taiwan), and many more. 
[0003] With the ongoing globalization of business and 
networks, it has become important for computers or 
other electronic devices to be able to handle multiple 
character encodings. For example, the same computer 
or electronic device may be used by persons of different 
nationalities who wish to interact with the computer or 
other electronic device in a different language. For each 
such language a different character set encoding is usu- 
ally needed. However, character sets for the same lan- 
guage can also differ. 

[0004] There is also a need to be able to convert from 
one character set encoding to another encoding. For 
example, a user in France using ISO 8859-1 may want 
to send an electronic mail message in French to a user 
in Israel who is using ISO 8859-8. Because the sender 
and receiver are using different character set encod- 
ings, the non-ASCII characters in the message will be 
garbled for the user in Israel. Ideally, one of the comput- 



ers or electronic devices would convert from one char- 
acter set to another character set. This has been 
achieved to a limited extent between a few character 
sets, but is largely not possible with modern computers 
5 or electronic devices. Code conversion is made difficult 
because of the numerous different character standards 
and the often conflicting or inconsistent national stand- 
ards. 

[0005] The Unicode™ standard (hereafter simply Uni- 
io code or Unicode standard) was developed to provide an 
international character encoding standard. The design- 
ers of the Unicode standard wanted and did provide a 
more efficient and flexible method of character identifi- 
cation. The Unicode standard includes characters of all 
is major International Standards approved and published 
before December 31, 1990, as well as other characters 
not in previous standards. The characters are encoded 
in the Unicode standard without duplication. The codes 
within the Unicode standard are 16-bits (or 2 bytes) 
20 wide. 

[0006] A character code standard such as the Uni- 
code standard facilitates code conversion and enables 
the implementation of useful processes operating on 
textual data. For example, in accordance with the above 

25 example, the computer or other electronic device in 
France can transmit Unicode characters and the com- 
puter or other electronic device in Israel can convert the 
Unicode characters it receives into a Hebrew based 
character set that is compatible with the computer or 

30 other electronic device in Israel. 

[0007] For additional detail about the Unicode stand- 
ard, see, e.g., The Unicode Standard, Worldwide Char- 
acter Encoding, Version 2.0, Addision- Wesley 1996, 
which is hereby incorporated by reference in its entirety. 

35 [0008] One problem with Unicode is that when the 
Unicode text originates from multiple different encod- 
ings, it is difficult to convert the Unicode text back to the 
original multiple different encodings. In particular, some 
computer systems or applications that execute on com- 

40 puter systems do not support Unicode encodings. 
Hence, when such computer systems or applications 
receive Unicode text they are not able to properly utilize 
the text. Hence, code conversion of the Unicode text to 
a target encoding understood by the computer system 

45 or application is needed. The difficulty is when the Uni- 
code originates from multiple different encodings, the 
computer system (e.g., operating system) would not 
normally understand how to convert the Unicode back 
to the original multiple different encodings. In some 

so cases, font or style information might be available and 
associated with the Unicode text so as to provide a sug- 
gestion as to the originating encodings. However, often 
such font or style information is not available. 
[0009] Thus, there is a need for improved approaches 

55 to converting Unicode text to multiple different encod- 
ings. 
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SUMMARY OF THE INVENTION 

[001 0] Broadly speaking, the invention relates to tech- 
niques for converting source text (e.g., Unicode text) to 
multiple different encodings. The invention operates s 
without any font or style information that would suggest 
the original encoding types. The invention is able to 
intelligently determine which of a variety of available tar- 
get encodings are most appropriate for the given source 
text. The determination of the most appropriate target 
encodings can be flexible enough to accommodate dif- 
ferent criteria or tolerance levels in performing its con- 
version. The criteria can, for example, be determined 
according to the intended use for the converted text, 
namely printing or displaying of the converted text. The 
various tolerance level can, for example, include strict, 
loose or fallbacks. Another aspect of the invention per- 
tains to the automatic identification of those target 
encoding that are available. 

[0011] The invention can be implemented in numer- 
ous ways, including as a system, an apparatus, a 
method, or computer readable medium. Several 
embodiments of the invention are summarized below. 
[0012] As a code conversion system for converting a 
source string to a target string, an embodiment of the 
invention includes: a target encoding list containing 
available target encodings for the code conversion sys- 
tem; and a multi-encoding code converter that receives 
the source string and converts the source string into the 
target string, the target string including a plurality of 
encoding runs of different ones of the available target 
encodings. 

[0013] As a computer-implemented method for con- 
verting a source encoding to target encodings selected 
from available target encodings, an embodiment of the 
invention includes the acts of: receiving a source text 
block, the source text block including a series of text ele- 
ments; selecting one of the available target encodings; 
selecting one of the text elements from the source text 
block; determining whether the selected text element 
can be converted into the selected target encoding; 
selecting a next one of the text elements from the 
source text block and repeating the determining when 
the selected text element can be converted into the 
selected target encoding; and selecting another one of 
the available target encodings and repeating the deter- 
mining for the selected text element when the selected 
text element cannot be converted into the selected tar- 
get encoding. 

[0014] As a computer-implemented method for pro- 
ducing a target encoding list for use by a code conver- 
sion system in converting characters in a source 
encoding to at least one target encoding, the target 
encoding list and the code conversion system being 
associated with a computer system, an embodiment of 
the invention includes the acts of: retrieving scripts that 
are installed on the computer system; retrieving fonts 
that are installed on the computer system; determining 



a portion of the retrieved fonts to be added lo the target 
encoding list; and producing the target encoding list 
from the retrieved scripts and the portion of the retrieved 
fonts. 

[0015] As a computer readable medium including 
computer program code for converting a source encod- 
ing to target encodings selected from available target 
encodings, an embodiment of the invention includes: 
first computer program code configured to receive a 
source text block, the source text block including a 
series of characters; computer program code config- 
ured to select one of the available target encodings; 
third computer program code configured to select one of 
the characters from the source text block; fourth compu- 
ter program code configured to determine whether the 
selected character can be converted into the selected 
target encoding; and fifth computer program code con- 
figured to select another one of the available target 
encodings and then to repeat the fourth computer read- 
able medium using the newly selected target encoding 
when the fourth computer readable medium determines 
that the selected text element cannot be converted into 
the selected target encoding. 

[0016] As a computer readable medium including 
computer program code for producing a target encoding 
list for use by a code conversion system in converting 
characters in a source encoding to at least one target 
encoding, the target encoding list and the code conver- 
sion system being associated with a computer system, 
an embodiment of the invention includes: computer pro- 
gram code configured to retrieve scripts that are 
installed on the computer system; computer program 
code configured to retrieve fonts that are installed on the 
computer system; computer program code configured 
to determine a portion of the retrieved fonts to be added 
to the target encoding list; and computer program code 
configured to produce the target encoding list from the 
retrieved scripts and the portion of the retrieved fonts. 
[001 7] The invention has various advantages depend- 
ing on the aspects of the invention being implemented. 
One advantage of the invention is that it converts Uni- 
code text to multiple target encodings so that a high 
quality code conversion is achieved. Another advantage 
of the invention is that it is not dependent on having any 
font or style information that would assist in the code 
conversion. Still another advantage of the invention is 
that it is efficient and controllable such that various cri- 
teria and tolerances can affect the code conversion. 
Another advantage of the invention is that the invention 
can also choose appropriate target encodings for fall- 
back mappings. Yet another advantage of the invention 
is the ability to identify available target encodings on a 
computer system in an automated fashion. 
[001 8] Other aspects and advantages of the invention 
wilt become apparent from the following detailed 
description, taken in conjunction with the accompanying 
drawings, illustrating by way of example the principles of 
the invention. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

[0019] The present invention will be readily under- 
stood by the following detailed description in conjunc- 
tion with the accompanying drawings, wherein like s 
reference numerals designate like structural elements, 
and in which: 

FIG. 1 is a code conversion system according to an 
embodiment of the invention; io 
FIG. 2 illustrates a block diagram of an embodiment 
of a Unicode code conversion system according to 
the invention; 

FIG. 3A illustrates a diagram of conversion utility 
table according to an embodiment of the invention; 15 
FIG. 3B is a block diagram of a representative con- 
version availability table for a particular target 
encoding; 

FIG. 4 is a flow diagram of target encoding list cre- 
ation processing according to an embodiment of 20 
the invention; 

FIG. 5 is a flow diagram of code conversion 
processing according to an embodiment of the 
invention; 

FIG. 6 is a flow diagram of longest target encoding 25 
run processing according to an embodiment of the 
invention; 

FIG. 7 is a flow diagram of multiple target coding 
selection processing according to an embodiment 
of the invention; 30 
FIG. 8 is a flow diagram of update text run array 
processing according to an embodiment of the 
invention; 

FIG. 9 is a flow diagram of fallback and/or default 
processing according to an embodiment of the 35 
invention; and 

FIG. 10 is block diagram of a representative compu- 
ter system in accordance with the present inven- 
tion. 

40 

DETAILED DESCRIPTION OF THE INVENTION 

[0020] The invention relates to techniques for convert- 
ing source text (e.g., Unicode text) to multiple different 
encodings. The invention operates without any font or 45 
style information that could suggest the original encod- 
ing types. The invention is able to intelligently determine 
which of a variety of available target encodings are most 
appropriate for the given source text. The determination 
of the most appropriate target encodings is flexible so 
enough to accommodate different criteria or tolerance 
levels in performing the conversion as may be desired. 
The criteria can, for example, be determined according 
to the intended use for the converted text, namely print- 
ing or displaying of the converted text. The various toler- ss 
ance levels can, for example, include strict, loose or 
fallback. The conversion out of Unicode into multiple dif- 
ferent encodings also requires the determination of 



where and when to switch between the available target 
encodings. The invention can use a variety of different 
approaches in deciding where and when to switch 
encodings as well as what encoding to switch to. 
[0021 ] Another aspect of the invention pertains to the 
automatic identification of those target encodings that 
are available. The identified available target encodings 
are provided to a code conversion system so that the 
source text can be converted to one or more different 
encodings that are known to be available to the system. 
[0022] The Unicode standard is a compilation of char- 
acters from other character encodings developed into a 
single, universal, international character encoding 
standard. The format of the Unicode character encod- 
ings are 16 bits wide. Within this document, Unicode 
characters are represented in hexadecimal with a pre- 
ceding u (e.g., u0041), and characters in other encod- 
ings are represented in hexadecimal with a preceding x 
(e.g., x41 for a 1 -byte character, x81 40 for a 2-byte char- 
acter). 

[0023] For ease of reference, the following definitions 
are useful in understanding code conversion: 

1 . Code Point: A code point is a bit pattern in a par- 
ticular encoding. Usually the bit pattern is one or 
more bytes long. A Unicode code point is always 1 6 
bits or two bytes. 

2. Encoding: An encoding is a one-to-one mapping 
between a set of characters and a set of code 
points. For example, the ASCII encoding maps a 
set including a-z, A-Z, and 0-9 to the code points 
xOO through x7F. 

3. Text Element: A text element is a sequence of 
one or more characters that are treated as a unit for 
a particular operation. For example, LATIN CAPI- 
TAL LETTER U followed by NON-SPACING 
DIAERESIS is a text element (e.g., two adjacent 
characters in this example) for the code conversion 
operation in accordance with the invention. 

4. Fallback: A fallback is a sequence of one or more 
characters in the target encoding that are not 
exactly equivalent to the source characters but 
which preserve some of the information of the orig- 
inal. For example, (C) is a possible fallback for ©. 

5. Default: A default is a sequence of one or more 
characters in the target encoding that are used 
when nothing in the target encoding even resem- 
bles the source code points. 

[0024] The general conversion technique according to 
the invention converts source characters to target char- 
acters of different multiple target encodings because 
the source characters are not all convertible into a sin- 
gle target encoding. Preferably, the source characters 
are Unicode characters. In the discussion provided 
below, it is assumed that the source characters are Uni- 
code characters. However, the invention is not limited to 
conversion of only Unicode characters. 
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[0025] Embodiments of the invention are discussed 
below with reference to FIGs. 1-10. However, those 
skilled in the art will readily appreciate that the detailed 
description given herein with respect to these figures is 
for explanatory purposes as the invention extends 
beyond these limited embodiments. 
[0026] FIG. 1 is a code conversion system 100 
according to an embodiment of the invention. The code 
conversion system 100 operates to convert text from a 
Unicode encoding to multiple target encodings. 
[0027] The code conversion system 1 00 begins with a 
client code conversion request 102. A client application 
typically calls the code conversion system 100 request- 
ing a code conversion of a source text string. The client 
code conversion request 102 includes at least the 
source text string which contains of a plurality of charac- 
ters. The source text string is supplied to a multi -encod- 
ing code converter 104. The multi-encoding code 
converter 1 04 operates to process the source text string 
to identify which portions of the source text string are to 
be converted to what particular target encodings. The 
multi-encoding code converter 104 receives a list of 
available target encodings from a target encoding list 
106. Typically, the target encodings contained within the 
target encoding list 106 are ordered in accordance with 
predetermined preferences. For example, the predeter- 
mined preferences could be listed in the target encoding 
list 106 with the most preferred target encoding being 
first in the list. 

[0028] The multi-encoding code converter 104 con- 
verts the source text string into a target string 108 using 
the target encoding list 106. The target string 108 
includes characters in multiple different encodings. The 
multi-encoding code converter 104 also produces a text 
run array 110. The text run array 110 indicates the par- 
ticular target encoding used for particular characters in 
the target string 108. In one embodiment, the text run 
array 1 1 0 provides a target encoding for sequential runs 
of text within the target string 108. In any event, the 
resulting output target string from the multi-encoding 
code converter contains characters that are encoded in 
multiple different target encodings. 
[0029] The code conversion system 1 00 is particularly 
suited for use on a computer system. For example, 
when an application program executing on a computer 
system does not support Unicode, an incoming Unicode 
text to the application cannot be properly displayed or 
printed by the application program. In such case, the 
Unicode text needs to be converted out of Unicode into 
some other target encodings that the application and 
the computer system do understand. However, given 
that Unicode includes codes for many of the worlds var- 
ious character sets (encodings), often a single target 
encoding is not able to be utilized for all of the Unicode 
text. Therefore, according to the invention, the code 
conversion system 100 is able to convert the Unicode 
text into multiple target encodings in an intelligent and 
efficient manner. 



[0030] The multi-encoding code converter 104 oper- 
ates to determine which of the available target encod- 
ings for a computer system are to be used in the 
conversion of the source string into the target string 

5 which ends up having multipl target encodings. The 
multi-encoding code converter 104 can perform such 
operations in a variety of approaches. A first approach 
permits conversion from Unicode text to multiple runs of 
text with different target encodings with a bias toward a 

w preferred target encoding. A second approach permits 
conversion from Unicode text to multiple runs of text 
with different target encodings while attempting to mini- 
mize switching between target encodings. The first and 
second approaches are representative of the variety of 

75 multi -encoding approaches that the multi -encoding con- 
verter 104 can utilize. 

[0031] Although the multi-encoding converter 104 
preferably operates to determine which of the available 
target encodings for a computer system are used in the 

20 conversion of the Unicode text into multiple target 
encodings, according to a third approach, the multi- 
encoding converter 1 04 can also operate to select a sin- 
gle target encoding that is well suited for conversion of 
the Unicode text even though it will likely not be able to 

25 convert certain portions thereof. Hence, the - third 
approach is a single-encoding approach. FIG. 6 illus- 
trates a representative single-encoding approach that 
could be utilized in this regard. 

[0032] FIG. 2 illustrates a block diagram of an embod- 
30 iment of a Unicode code conversion system 200 accord- 
ing to the invention. The Unicode code conversion 
system 200 is, for example, an implementation of the 
code conversion system 100 illustrated in FIG. 1. The 
Unicode code conversion system 200 operates to con- 
35 vert characters of a source string into one or more.char- 
acters in a target string which are in different character 
encodings than the encoding utilized in the source 
' string. Preferably, the Unicode code conversion system 
200 converts from Unicode to different target encod- 
40 ings. 

[0033] The Unicode code conversion system 200 
includes a From-Unicode converter 202 which receives 
a Unicode string 204 and produces a target string 206 
and a text run array 207. The target string 206 includes 

45 multiple different target encodings. The text run array 
207 specifies where in the target string 206 the different 
encodings occur. In other words, the text run array 207 
indicates the particular target encoding used in convert- 
ing the portions of the Unicode string 204. 

so [0034] The From-Unicode converter 202 performs the 
code conversion process. In so doing, the From-Uni- 
code converter 202 interacts with a scanner 208. The 
scanner 208 in conjunction with a scanner table 210 
scans the Unicode string 204 to identify a text element. 

55 The From-Unicode converter 202 then uses a lookup 
handler 212 to look up the one or more characters in tar- 
get encoding for the text element identified by the scan- 
ner 208. The lookup handler 212 uses mapping tables 



BNSDOCID: <EP 0989499A2 I > 



9 



EP0 989 499 A2 



10 



214 to obtain the one or more characters in the target 
encoding for the text element. In one embodiment, dif- 
ferent mapping tables are~provided for each of the differ- 
ent target encodings available to the Unicode code 
conversion system. Additionally, the From-Unicode con- 
verter 202 may also use a fallback handler 216 when 
enabled. The fallback handler 216 operates together 
with the mapping tables 214 to identify one or more 
characters in the target encoding that are able to be 
used as a fallback mapping for the text element in cases 
where the look-up handler 212 has been unable to iden- 
tify one or more characters in the target encoding for the 
text element. A state administrator 218 maintains or 
stores information on the current state of the conver- 
sion. This information may, for example, include context, 
direction and state of symmetric swapping. 
[0035] Besides the Unicode string 204, the From-Uni- 
code converter 202 also receives from a client (that has 
called the Unicode code conversion system 200) a tar- 
get encodings list 220 and perhaps calling options 222. 
The target encodings list 220 includes a list of available 
target encodings. The target encoding list 220 is used 
by the From-Unicode converter 202 to facilitate conver- 
sion to multiple target encodings that are determined to 
be available. Typically, the target encodings contained 
within the target encodings list 220 are ordered in 
accordance with predetermined preferences. For exam- 
ple, the predetermined preferences could be listed in 
the target encodings list 220 with the most preferred tar- 
get encoding being first in the list. The calling options 
222 contain client selling of the various options available 
by the Unicode code conversion system 200, such as 
fallbacks, tolerances, etc. 

[0036] The scanner 208 in conjunction with the scan- 
ner table 210 scans the Unicode string 204 and returns 
the next text element and any additional information 
needed by the look-up handler 212. The additional infor- 
mation includes one or more of direction information, 
context information, and various state indicators. The 
general operation of the scanner 208 is as follows. The 
scanner 208 scans through the characters of the input 
Unicode string 204. If direction information is needed for 
the target encoding, then the character direction is 
obtained for at least the first character in the text ele- 
ment. Also, if character context information is needed 
for the target encoding, then character context informa- 
tion is obtained for at least the first character in the text 
element. Then, as the scanner 208 scans through each 
of the characters, the scanner 208 takes an action for 
the character in accordance with information residing in 
the scanner table 210. The particular action that the 
scanner 208 takes is determined based on state and 
character class. The actions that the scanner 208 can 
take include: marking the current character, selling or 
clearing the symmetric swapping bit, noting the contex- 
tual form of a text element, setting a flag that indicates 
that the text element will need reordering, and indicating 
nd of the text element. The symmetric swapping bit, 



the context and the direction are saved by the state 
administrator 218 as information pertaining to the state 
of the scanner. Before returning, the scanner 208 saves 
context information for the text element. The scanner 

5 208 returns the text element (each text element within 
the input string) and its attributes. The attributes include 
the following: direction, symmetric swapping state, and 
context. After the scanner 208 determines a text ele- 
ment, then the characters may need to be reordered 
10 into canonical order. As an example, reordering of the 
characters within a text element is done when the text 
element includes non-spacing marks that are not in 
canonical order as defined by Unicode. 
[0037] Preferably, the scanner 208 together with the 

is scanner table 210 are implemented as a pair of state 
machines that operate in parallel. A first state machine 
resolves the character direction, and a second state 
machine computes text elements and character form 
context information where applicable and also main- 

20 tains the symmetric swapping state. By using two sepa- 
rate state machines, the Unicode code conversion 
system 200 is easier to design and maintain. The first 
and second state machines can be implemented as 
two-dimensional arrays (or tables) indexed by state and 

25 class. In cases where the action the scanner 208 is to 
take depends on the character direction, then the state 
machine entry is an index into another table which con- 
tains the appropriate action for the scanner 208 to take 
for each direction. 

30 [0038] The function of the scanner 208 is to convert 
the input Unicode string 204 into text elements and to 
return the text elements and their attributes. The scan- 
ner 208 needs to save certain characteristics of the text 
element so that it can be properly converted in the tar- 

35 get encoding. Namely, the characteristics include the 
direction, the context and the symmetric swapping 
state. However, the scanner 208 need not know what 
the target encoding is because its operation is inde- 
pendent of the particular target encoding. Nevertheless, 

40 the Unicode conversion system 200 is preferably imple- 
mented such that the definition of a text element (i.e., 
the chunking behavior) could vary with the target encod- 
ing simply by modifying the scanner table 210. 
[0039] The directionality of characters is used for 

45 presentation of the characters. For example, when Ara- 
bic or Hebrew are displayed on a display screen, they 
are ordered from right-to-left. Most Unicode characters 
have an implicit direction, see Unicode Version 2.0, at p. 
3-14 (Section 3.11) and p. 4-10 (Section 4.3). The 

50 implicit direction classes provided with the Unicode 
standard and their values include: Left- Right (0), Right- 
Left (1), European Number (2), European Number Sep- 
arator (3), European Number Terminator (4), Arabic 
Number (5). Common Number Separator (6), Block 

55 Separator (7), Segment Separator (8). Whitespace (9), 
and Other Neutrals (10). The scanner 208 looks up the 
direction class for characters of the text element. The 
direction class is then used to resolve the direction of 
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the text etement. There are also special Unicode char- 
acters which cause overriding or embedding of direc- 
tionality. These special direction Unicode characters are 
treated by the scanner 208 as single character text ele- 
ments. 5 
[0040] There are some basic rules that the scanner 
208 follows in forming the text elements. The base rule 
is that if none of the rules apply, then the text element is 
a single Unicode character. Another rule is that non- 
spacing or combining marks following a base character 
are grouped with the base character as a single text ele- 
ment. Yet another rule is that characters associated with 
symbols (e.g., Korean Hangul Jamos characters), liga- 
tures or ideographs are encountered, they are com- 
b>ned into text elements. Still another rule is that when a 
U action slash is surrounded on each side by a decimal 
digit, they are combined as a numeric fraction text ele- 
ment. 

(0041] The mapping tables 214 are used by the 
lookup handler 212 to match an input sequence of one 
1 more Unicode characters to an output sequence of 
ne or more characters in a particular target encoding, 
in addition to the Unicode sequence (i.e., text element) 
its*rt. certain additional pieces of information about the 
input sequence are available (e.g., direction, context, 
symmetric swapping state, vertical forms request, fall- 
back request, tolerance, variant), and some tables 
make use of this information. Preferably, the mapping 
ta3e 214 also stores data needed by the fallback han- 
dle* 216, though a separate table could be provided for 
use by the fallback handler 216. 
[0042] To speed the determination of whether a par- 
ticular Unicode character will convert into a particular 
ta get encoding, certain embodiment of the invention 
can make use of a conversion availability table. FIG. 3A 
illustrates a diagram of conversion availability table 300 
according to an embodiment of the invention. The con- 
version availability table 300 is specific to a particular 
La- pet encoding and thus a plurality of different conver- 
wn availability tables are provided. The conversion 
«vaiiab*lity table 300 includes a header portion 302, an 
fjrttr y list portion 304, and bit arrays 306. The header 
pu*t>on indicates (i) the number of entries in the entry 
bsj portion and (ii) an offset to the start of the bit arrays 
300 The entry list portion 304 is indexed by Unicode 
evader codes and provides an indication whether the 
Opcode character code is able to be converted into the 
particular target encoding. This indication can be 
obtained either directly from the entry list portion 304 or 
mduectty from one of the bit arrays 306 as explained 
b#*o* with reference to FIG. 3B. 
[0043] FIG. 3B is a block diagram of a representative 
corversion availability table 350 for a particular target 
encoding. The conversion availability table 350 includes 
a header 352 that includes an entry count 354 and a 
base offset (OFFSET) 356. The conversion availability 
tame 350 also includes an entry list portion 360 shown 
containing hexadecimal values. More particularly, a first 



portion 362 contains Unicode character code values 
and a second portion 364 contains a value that indi- 
cates either that (i) the corresponding Unicode charac- 
ter code values are or are not convertible into the target 
encoding, or (ii) an offset a bit array section. In this rep- 
resentative embodiment, the bit array section within the 
conversion availability table 350 includes a first bit array 
366 (B1T-ARRAY-1) at offset 0x0000 in the bit array sec- 
tion and a second bit array 368 (BIT ARRAY-2) at offset 
0x0020 in the bit array section. 
[0044] The operation and usage of the conversion 
availability table 350 is explained by the following exam- 
ples. In general, the conversion availability table is used 
to speed the determination of whether a particular Uni- 
code character will convert into a particular target 
encoding. 

[0045] In one example, if the Unicode character 
%ou0021A were to be checked for convertibility to the 
particular target encoding, then the first row of the entry 
list portion 360 would be the appropriate row since 
uO021 is with the range of characters between the first 
and second rows of the entry list portion 360 (namely 
the range uOOOO , u007F). Hence, the value 
%o0xFFFFAin the second portion 364 of the first row is 
used to indicate that the Unicode character u0021 can 
be converted into the particular target encoding. 
[0046] In another example, if the Unicode character 
%ou00A0A were to be checked for convertibility to the 
particular target encoding, then the third row of the entry 
list portion 360 would be the appropriate row since 
uOOAO begins the range of characters between the third 
and fourth rows of the entry list portion 360 (namely the 
range uOOAO, u02BF). Hence, the value %o0x0000Ain 
the second portion 364 of the third row is used to pro- 
vide a partial offset to one of the bit arrays. The partial 
offset 0x0000 is added to the base offset 356 to obtain 
a pointer (location or address) to the first bit array 366 
that contains a flag for each Unicode character indicat- 
ing the convertibility of each of the Unicode characters 
within the range OxOOAO , 0x02BF into the particular tar- 
get encoding. 

[0047] In still another example, if the Unicode charac- 
ter %ou02C5A were to be checked for convertibility to 
the particular target encoding, then the fifth row of the 
entry list portion 360 would be the appropriate row since 
u02C5 is within the range of characters between the 
fifth and sixth rows of the entry list portion 360 (namely 
the range u02C0 , u033F). Hence, the value %o0x0020A 
in the second portion 364 of the fifth row is used to pro- 
vide a partial offset to one of the bit arrays. The partial 
offset 0x0020 is added to the base offset 356 to obtain 
a pointer (location or address) to the second bit array 
368 that contains a flag for each Unicode character indi- 
cating the convertibility of each of the Unicode charac- 
ters within the range u02C0 , u033F into the particular 
target encoding. 

[0048] In yet another example, if the Unicode charac- 
ter %ou0395A were to be checked for convertibility to the 
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particular target encoding, then the sixth row of the 
entry list portion 360 would be the appropriate row since 
u0395 is with the range of characters following the sixth 
(and last) row of the entry list portion 360 (namely the 
range u0340 , uFFFF). Hence, the value %o0xFFFEA in 
the second portion 364 of the sixth row is used to indi- 
cate that the Unicode character u0395 cannot be con- 
verted into the particular target encoding. The values 
OxFFFF and OxFFFE within the second portion 364 of 
the entry list table 360 can also be considered a con- 
vertibility flag for their associated range of character 
- codes. 

[0049] As noted above with respect to FIG. 2, the 
From- Unicode converter 202 makes use of the target 
encoding list 220 in determining what particular target 
encodings should be considered when converting from 
Unicode to other character encodings that are more 
desirable. The target encoding list 220 specifies the tar- 
get encodings that are supported by a particular compu- 
ter system (e.g., platform or operating system). While 
the target encoding list 220 could be manually produced 
and supplied to the computer system on which the code 
conversion associated to the invention is to be per- 
formed, it is advantageous to automatically generate 
the target encoding list by automatically interacting with 
the computer system. The automatic generation of a 
target encoding list is described below with respect to 
FIG. 4. 

[0050] FIG. 4 is a flow diagram of target encoding list 
creation processing 400 according to an embodiment of 
the invention. The target encoding list creation process- 
ing 400 initially retrieves 402 installed scripts (locales) 
from the operating system of the computer system. Nor- 
mally, the installed scripts (locales) would be identifies 
by script (locale) codes. A script (system script) or 
locale is a collection of software facilities that provide for 
basic differences between writing systems and lan- 
guage preferences. Examples are character sets, fonts, 
date and number formatting, text collection, etc. A script 
or locale code is a number indicating a particular script 
or locale in a particular system. Typically, the scripts 
(locales) obtained from an operating system are in a 
preference order. However, if the retrieved scripts 
(locales) are not in a preference order, then they can be 
rearranged into a predetermined preference order. 
Once the installed scripts are retrieved 402, the 
installed scripts are placed 404 in a target encoding list 
in accordance the preference order. The target encod- 
ing list is initially formed by placing the installed scripts 
into the target encoding list. The installed scripts are 
placed in the target encoding list in accordance with the 
preference order. Next, installed fonts are retrieved 406 
from the operating system. Here, the target encoding 
list creation processing 400 asks and then retrieves the 
installed fonts that are available from the operating sys- 
tem. Then, certain of the installed fonts are added 408 
to the target encoding list. Typically, the certain of the 
installed fonts that are added 408 are those that are 



either symbol fonts or fonts that will yield a variant text 
encoding from one of the installed scripts or symbol 
fonts. Following block 408, the target encoding list crea- 
tion processing 400 is complete and ends. At this point, 

5 the target encoding list has been created and is availa- 
ble for subsequent use by the multi-encoding converter. 
[0051] A representative example of the operation of 
the target encoding fist creation processing 400 is pro- 
vided below. First, assume that in response to a request 

10 for the install scripts on the computer system, the oper- 
ating system returns three (3) scripts in their order of 
preference. The target encoding list would then include 
these scripts in their order of preference and be repre- 
sented by the following table. 

15 



Target Encoding List 
20 MacRoman 

MacJapanese 
MacKorean 

25 [0052] Next, the installed fonts on the computer sys- 
tem are retrieved. Assume here that there are eight (8) 
installed fonts in the computer system as provided in the 
following table. 



Installed Fonts 
Geneva 

35 Chicago 
Times 
Osaka 
ChuGothic 

40 ; 

Seoul 

Dingbats 

VT100 



[0053] Then, certain of the installed fonts are added to 
the target encoding list As a result, the representative 
target encoding list would be represented by the follow- 
so ing table. 



Target Encoding List 

55 

MacRoman 
MacJapanese 
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(continued) 
Target Encoding List 



MacJapanese + Postscript variant 



MacKorean 



Dingbats 



VT100 



[0054] Note that the target encoding %oMacJapanese 
+ Postscript variant A is provided because the installed 
fonts contained the font %<>ChuGothicA which is a vari- 
ant font on MacJapanese. Hence, the %<Jv1acJapanese 
+ Postscript variantA was added to the target encoding 
list just after its non-variant version. Also, the symbol 
fonts %oDingbatsA and %oVT1 OOA are known to be sym- 
bol fonts and are thus were also added to the target 
encoding list. In one embodiment, the target encoding 
list creation processing 400 is able to determine those 
of the install fonts that are variant fonts or symbol fonts 
by using a variant table and a symbol table stored within 
the computer system. For example, for the representa- 
tive embodiment, the variant table and the symbol table 
could be as follows. 



Script 


Font 


Varient 


MacJapanese 


ChuGothic 


Postscript 


MacJapanese 


TohabaGothic 


Basic 


MacArabic 


Baghdad 


TrueType 


MacHebrew 


RamatSharon 


FigureSpace 



10 



15 



20 



25 



30 
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application can obtain its own copy of the target encod- 
ing list to perform its code conversion processing. 
[0056] A decision block 504 then determines whether 
the application has an encoding preference. When the 
decision block 504 determines that the application has 
an encoding preference that differs from the system 
encoding, then a copy of the target encoding list is 
adjusted 506 to place the applicationAs preferred 
encoding at the top of the target encoding list. Alterna- 
tively, when the decision block 504 determines that the 
application does not have an encoding preference dif- 
ferent from the system encoding, then block 506 is 
bypassed. 

[0057] For example, if the obtained copy of the target 
encoding list was as follows, 



Target Encoding List 



MacRoman 



MacJapanese 



MacJapanese + Postscript variant 



MacKorean 



Dingbats 



VT100 



and knowing that the application preference is for 
%oMacJapaneseA, then the copy of the target encoding 
list is adjusted 506 to move %oMacJapaneseA to the top 
of the target encoding list. The adjusted target encoding 
list is then as follows. 



Symbol Table 
Symbol 

Dingbats 45 
VT100 

[0055] FIG. 5 is a flow diagram of code conversion 
processing 500 according to an embodiment of the so 
invention. The code conversion processing 500 is 
invoked when a request for code conversion from Uni- 
code to other target encodings besides Unicode is 
made. The code conversion processing 500 initially 
obtains 502 a copy of the target encoding list. Here, the ss 
target encoding list resides in the memory of the com- 
puter system and is for general use by various applica- 
tions on the computer system. Hence, each individual 



Target Encoding List 
MacJapanese 
MacRoman 

MacJapanese + Postscript variant 

MacKorean 

Dingbats 

VT100 



[0058] Following block 506, or following the decision 
block 504 when the application does not have an encod- 
ing preference, a decision block determines whether the 
client (the caller for the code conversion) has a pre- 
ferred encoding. When the decision block determines 
that the client does has a preferred encoding, the copy 
of target encoding list can be further adjusted 510 to 
place the clientAs preferred encoding at the top of the 
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target encoding list. Alternatively, when the decision 
block 508 determines that the client does not have a 
preferred encoding, then block 510 is bypassed. 
[0059] Following block 510, or following the decision 
block 508 when the application does not have an encod- s 
ing preference, a code converter is called 51 2. The code 
converter is then able to perform the conversion of Uni- 
code text into the target encodings that are supported 
by the application in accordance with the order of pref- 
erence. For example, with respect to FIG. 1 , the Uni- 10 
code text and the adjusted copy of the target encoding 
list would be utilized by the multi-encoding code con- 
verter 104 to produce the target string 108 containing 
multiple different encodings and the resulting text run 
airay 110 indicating where the multiple different encod- is 
trigs are used within the target string 108. 
(0060} As noted above, the multi-encoding code con- 
verter can operate using a number of different 
approaches. An embodiment of the single-encoding 
approach is described below with respect to FIG. 6, and 20 
embodiments of the first and second (multi-encoding) 
approaches are described below with respect to FIGs. 
7-9 

(0061] FIG 6 is a flow diagram of longest target 
encoding run processing 600 according to an embodi- 25 
ment of the invention. The longest target run processing 
600 can. for example, be performed by the multi-encod- 
ing code converter 104 illustrated in FIG. 1 or the From- 
Urncode converter 202 illustrated in FIG. 2 when using 
tne third approach for conversion of Unicode text into a 30 
single target encoding. While the longest target encod- 
ing run processing 600 does consider multiple target 
encodings, only one target encoding is eventually 
selected and used. 

(0062] The longest target encoding run processing 35 
600 begins with initialization of some variables. Specifi- 
cally, a best target encoding is set 601 to the first encod- 
ing «n the target encoding list, and a maximum run 
length ts set 601 to zero (0). Then, an initial target 
encoding is selected 602 from a target encoding list. 40 
locally, the initial target encoding in the target encod- 
ing bst is the most preferred target encoding for the 
curTjjuter system. Next, an initial character is selected 
0O4 m the Unicode text block. Then, the selected char- 
acter is looked-up 606 in a conversion availability table. 45 
Tne result of the look-up 606 is a flag or value that indi- 
cates whether or not conversion to the target encoding 
ft available. The conversion availability table is, for 
«t ample, the conversion availability table 300 illustrated 
m F «j 3A or the conversion availability 350 illustrated in so 
f *G 3B The conversion availability table yields a rapid 
r»*ciSion on whether conversion to the target encoding 
Available. 

[006 3] Next, a decision block 608 determines whether 
conversion is available. When the decision block 608 55 
determines that conversion of the character into the 
selected target encoding is not available, then a deci- 
sion block 610 determines whether the run length is 



greater than a maximum run length. Here, the run 
length is the number of characters that have been suc- 
cessively determined to be able to be converted into the 
selected target encoding. When the decision block 610 
determines that the run length is greater than the maxi- 
mum run length, then the maximum run length is set 
612 to the run length. Thus, block 612, in effect, updates 
the maximum run length to the run length of the current 
target encoding run when it is longer than the previous 
maximum. Next, a best target encoding is set 614 to the 
selected target encoding. In other words, the best target 
encoding is the target encoding that yields the longest 
target encoding run. Alternatively, when the decision 
block 610 determines that the length is not greater than 
the maximum length, then block 612 and 614 are 
bypassed. 

[0064] Following the block 614 or following the deci- 
sion block 610 when the run length does not exceed the 
maximum run length, a decision block 616 determines 
whether there are more target encodings to be consid- 
ered. When the decision block 616 determines that 
there are no more target encodings to be considered, 
the longest target encoding run processing 600 is com- 
plete and ends. 

[0065] Alternatively, when the decision block 616 
determines that there are more target encodings to be 
considered, the run length is initially set 618 to zero (0) 
for the processing of another run length. Following block 
618, the longest target encoding run processing 600 
returns to repeat the block 602 and subsequent blocks 
so that the available run length for other target encod- 
ings can be processed. Here, when repeating block 
602, a next target encoding within the target encoding 
list is selected 602. Then, the initial character in the Uni- 
code text block would be selected 604 and looked-up 
606 in the conversion availability table in the manner 
previously discussed. 

[0066] On the other hand, when the decision block 
608 determines that conversion of the selected charac- 
ter is available, then the run length is incremented 620. 
Next, a decision block 622 determines whether there 
are more characters in the Unicode text block to be 
processed. When the decision block 622 determines 
that there are more characters in the Unicode text block 
to be processed, the longest target encoding run 
processing 600 returns to repeat the block 604 and sub- 
sequent blocks so that subsequent characters in the 
Unicode text block are able to be selected 604 and sub- 
sequently processed. Alternatively, when the decision 
block 622 determines that there are no more characters 
in the Unicode text block to be processed, the longest 
target encoding run processing 600 returns to repeat 
the block 61 0 so that other encoding can be considered. 
[0067] The result of the longest target coding run 
processing 600 is the identification of a best target 
encoding for converting of the Unicode text block to one 
of the available target encodings. According to the long- 
est target coding run processing 600, the best target 
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encoding is the target encoding that is not only sup- 
ported by the computer system but also achieves the 
longest run length (i.e., the longest successive number 
of characters that can be converted to the target encod- 
ing). The best target encoding is predicted to be the 5 
best choice of the available target encodings, but some 
portions of the Unicode text block may not convert into 
the best target encoding. 

[0068] The first and second approaches noted above 
are multi-encoding approaches that permit runs in dif- 10 
ferent target encodings when converting out of Unicode. 
The first and second approaches thus allow the conver- 
sion to be carried out such that different portions of the 
Unicode text block are converted to different target 
encodings. The first approach allows switching to differ- is 
ent target encodings during the code conversion 
processing but prefers to remain in a particular target 
encoding. As an example, the first approach is suitable 
for an English speaker who desires to have an applica- 
tion display as much of the Unicode text block as possi- 20 
ble in a Roman based target encoding. The second 
approach also allows switching to different target 
encodings during the code conversion processing but 
prefers to minimize the number of times the target 
encoding is changed. As an example, the second 25 
approach is suitable for a user of an application that 
desires to be able to efficiently print documents and 
thus minimizes the switching between target encodings 
and thus fonts. 

[0069] FIG. 7 is a flow diagram of multiple target 30 
encoding selection processing 700 according to an 
embodiment of the invention. The multiple target encod- 
ing selection processing 700 is, for example, performed 
by performed by the multi-encoding code converter 104 
illustrated in FIG. 1 or the From-Unicode converter 202 35 
illustrated in FIG. 2 when using the first or second 
approaches for conversion of Unicode text into multiple 
different target encodings. 

[0070] The multiple target encoding selection 
processing 700 initially initializes or sets 701 some var- 40 
iables to initial states. Specifically, a current encoding is 
set 701 to the first encoding in the target encoding list, 
and a run count is set 701 to zero (0). Then, a first target 
encoding is selected 702 from the target encoding list. 
Typically, the first target encoding is the most preferred 45 
target encoding for use on a computer system: Next, an 
initial character in the Unicode text block is selected 
704. Then, the selected character is looked-up 706 to 
determine whether the selected character is able to be 
converted into the target encoding. This look-up 706 is so 
referred to as a look-up process. In one embodiment, 
the selected character is looked-up 706 into a mapping 
table by a look-up handler, such as the look-up table 
212 and the mapping table 214 illustrated in FIG. 2. 
[0071] Next, decision block 708 determines whether ss 
the look-up process returned an error. When the look- 
up process does return an error, it is understood that the 
selected character cannot be converted into the 



selected target encoding. Hence, the multiple target 
encoding selection processing 700 operates to thereaf- 
ter determine whether other available target encodings 
are successful. Here, the first target encoding is 
selected 709 in the target encoding list. Then, a deci- 
sion block 710 determines whether the selected target 
encoding is equal to the current target encoding. More 
particularly, the decision block 710 determines whether 
the selected target encoding is the same encoding that 
was already previously used unsuccessfully in attempt- 
ing to determine whether the selected character can be 
converted. This can happen when the look-up process 
does not always begin at the beginning of the target 
encoding list but instead biases the look-up process 
towards one of the target encodings other than the first 
in the target enooding list. For example, the bias could 
be towards a particular encoding within the target 
encoding list, which would cause the processing 700 to 
always first try that target encoding for subsequent 
characters being converted, and if that failed, then pro- 
ceed through the target encoding list from the top. In 
this example, the particular target encoding in the target 
encoding list that was already initially tried due to the 
biasing is effectively skipped. In any case, when the 
selected target encoding does not match the current tar- 
get encoding, the selected character is looked-up 712 
using the selected target encoding. Following block 712, 
a decision block 714 determines whether an error 
occurred during the look-up process. The error indi- 
cates that the character is unable to be converted to the 
selected target encoding. r 
[0072] When the decision block 71 4 determines that 
an error has occurred, then processing proceeds to a 
decision block 716. Also, when the decision block 710 
determines that the selected target encoding is equal to 
the current target encoding, then the decision block 716 
is also performed because blocks 712 and 714 are 
bypassed. The decision block 716 determines whether 
there are more target encodings to be considered. 
When the decision block 716 determines that there are 
more target encodings to be considered, a next target 
encoding is selected 718 from the target encoding list. 
Following block 718, the multiple target encoding selec- 
tion processing 700 returns to repeat the decision block 
710 and subsequent blocks. 

[0073] The processing associated with blocks 710- 
718 continues until either all of the target encodings 
have been attempted or one of the target encodings is 
able to convert the selected character into the selected 
target encoding. In this regard, when the decision block 
716 determines that there are no more target encodings 
to be considered, a decision block 720 determines 
whether the fallback option is enabled. For example, the 
fallback option can be enabled or disabled by the calling 
options 222. When the decision block determines that 
the fallback option is disabled, then an error condition is 
noted 722 and the multiple target encoding selection 
processing 700 is complete and ends. Alternatively, 
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when the decision block 720 determines that the fall- 
back option is enabled, then fallback and/or default 
processing is performed 724. The fallback and/or 
default processing 724 is described below with respect 
to FIG. 9. 5 
[0074] Following block 714 when the look-up process 
does not return an error, in the case of the second 
approach the current encoding is set 726 to the selected 
encoding to bias the encoding choice for the subse- 
quent characters to the encoding last used. In the case 10 
of the first approach, block 726 would be bypassed. 
[0075] Following the block 724 or 726, as well as 
directly following the decision block 708, update text run 
array processing 728 is performed. The update text run 
array processing 728 is described below with respect to is 
FIG. 8. 

[0076] Following the block 728, a decision block 730 
determines whether there are more characters in the 
Unicode text block to be processed. When the decision 
block 730 determines that there are more characters in 20 
the Unicode text block to be processed, the multiple tar- 
get encoding selection processing 700 returns to repeat 
the block 704 and subsequent blocks. When this repeat- 
ing occurs, the block 704 operates to select the next 
character in the Unicode text block. The next target 25 
encoding chosen is biased towards a particular pre- 
ferred target encoding in the first approach, and is 
biased towards the same encoding as previously used 
in the second approach. Alternatively, when the deci- 
sion block 730 eventually determines that there are no 30 
more characters in the Unicode text block to be con- 
verted, the multiple target encoding selection process- 
ing 700 is complete and ends. 
[0077] When the multiple target encoding selection 
processing 700 and the obtaining of the code conver- 35 
sions are completed, the target string stores the charac- 
ters of multiple different encoding and the text run array 
stores the associations of these characters to the partic- 
ular multiple target encodings. An example of a repre- 
sentative text run array for a Unicode text block of fifty 40 
(50) characters is as follows. 



Text Run Array 45 
MacRoman /0 
MacJapanese/ 10 

MacRoman /1 4 

so 

Dingbats / 44 
MacRoman / 45 

[0078] In this example, the first Unicode characters 55 
are converted to ten (10) bytes in MacRoman encoding, 
the next Unicode characters are converted to four (4) 
bytes in MacJapanese encoding, the next Unicode 



characters are converted to thirty (30) bytes in MacRo- 
man encoding, the next Unicode character is converted 
to one (1) byte in Dingbats (symbol), and the remaining 
Unicode characters are converted to Mac Roman 
encoding. A character in Unicode is two (2) bytes. This 
example of the text run array pertains to the first 
approach which biases the target enooding to MacRo- 
man. 

[0079] The same example in the case of the second 
approach might produce the following text run array if 
the later MacRoman characters are also representable 
in MacJapanese which is often the case. Here, the sec- 
ond approach attempts minimizes changes in the 
encoding by remaining in the last used encoding as long 
as possible. 



Text Run Array 
MacRoman / 0 
MacJapanese / 10 
Dingbats / 44 
MacRoman / 45 



[0080] FIG. 8 is a flow diagram of update text run array 
processing 800 according to an embodiment of the 
invention. The update text run array processing 800 
operates to produce a text run array that is utilized by 
the Unicode conversion system to convert the Unicode 
text block into the target text block having multiple target 
encodings. In this embodiment, the text run array is 
updated an entry at a time. 

[0081] The update text run array processing 800 
begins with a decision block 802. The decision block 
802 determines whether the run count is at zero (0) or 
whether the current target encoding is not equal to the 
previous text run array element. When either the run 
count is zero (0) or the current target encoding is not 
equal to the previous text run array element (which sig- 
nifies or identifies the last target encoding used), then a 
decision block 804 determines whether the run count is 
less than a maximum number of runs. The maximum 
number of runs designates the number of different runs 
of different target encodings that is permitted by the 
code conversion system for conversion of the Unicode 
text block. When the decision block 804 determines that 
the run count is not less than the maximum number of 
runs, then an error is returned 806 that specifies that the 
text run array is fulL Following block 806, the update text 
run array processing 800 returns without having 
updated the text run array. 

[0082] On the other hand, when the decision block 
804 determines that the run count is less than the max- 
imum number of runs, then the current target encoding 
and the current offset (i.e., offset into the Unicode text 
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block) are stored 808 in the text run array. The offset is 
determined by a run length that identifies the number of 
characters in a run that can be successively converted 
to the current target encoding. Then, the run count is 
incremented 810. Following block 810, the update text 
run array processing 800 returns after having modified 
the text run array to include information on a run within 
the Unicode text block. 

[0083] In another case, when the decision block 802 
determines that the run count is not zero (0) or that the 
current target encoding matches the previous text run 
array element (i.e., the last target encoding used), then 
the update text run array processing 800 can return 
immediately without updating the text run array. In this 
case, there is no need to add an additional entry 
because the last entry in the text run array represents a 
new run and thus the character being currently proc- 
essed forms part of the same run. In this embodiment, 
the text run array simple stores the target encoding for a 
new run and the offset into the Unicode text block for the 
start of the new run. Hence, between entries in the text 
run array are ranges of characters that are associated 
with a run and use the target encoding identified for 
each new run. 

[0084] FIG. 9 is a flow diagram of fallback and/or 
default processing 900 according to an embodiment of 
the invention. The fallback and/or default processing 
900 is, for example, performed by the fallback and/or 
default processing 724 illustrated in FIG. 7. Also, the 
fallback and/or default processing 900 is, for example, 
performed by the multi-encoding code converter 104 
illustrated in FIG. 1 or the From-Unicode converter 202 
illustrated in FIG. 2. The client of the code conversion 
system is able to control (e.g., calling options) whether 
fallback and/or default processing is performed. The fall- 
back and/or default processing results in a loss of preci- 
sion in the conversion process, though more 
conversions are possible. Fallbacks are used by the 
code conversion system to be able to convert Unicode 
characters into characters of the target encodings while 
permitting some degree of flexibility to obtain conver- 
sion of more characters. 

[0085] The fallback and/or default processing 900 ini- 
tially looks up 902 the selected character using the 
selected target encoding with fallbacks. Again, the look- 
up of the selected character is referred to as the look-up 
process. A decision block 904 then determines whether 
an error occurred during the look-up process. When the 
decision block 904 determines that an error did not 
occur, then the look-up 902 was able to successfully 
convert the selected character and thus the fallback 
and/or default processing 900 returns. When the fall- 
back and/or default processing 900 returns, the 
processing of the multiple target encoding selection 
processing 700 continues with the block 728 where the 
text run array is updated. 

[0086] On the other hand, when the decision block 
906 determines that an error did occur during the look- 



up process, then the fallback and/or default processing 
900 needs to continue so that a suitable target encoding 
can be found for the selected character of the Unicode 
text block. The first target encoding is selected 709 in 

5 the target encoding list so that the search for the suita- 
ble target encoding begins at the top of the target 
encoding list. A decision block 908 then determines 
whether the selected target encoding matches the cur- 
rent target encoding. When the selected target encod- 

10 ing does not match the current target encoding, the 
selected character is looked-up 910 using the selected 
target encoding with fallbacks. Next, a decision block 
912 determines whether an error occurred during the 
look-up process. When the decision block 912 deter- 

75 mines that an error has not occurred, in the case of the 
second approach, the current encoding is set 913 to the 
selected encoding to bias the encoding choice for the 
subsequent characters to the encoding last used. In the 
case of the first approach, block 913 would be 

20 bypassed. Following block 913, the fallback and/or 
default processing 900 is complete and returns because 
the look-up 910 was able to successfully convert the 
selected character using fallbacks. 
[0087] Alternatively, when the decision block 912 

25 determines that an error did occur during the look-up 
process, then a decision block 914 is performed. The 
decision block 914 is also performed when the decision 
block 908 determines that the selected target encoding 
matches the current target encoding. When the 

30 selected target encoding matches the current target 
encoding (i.e., last used target encoding within the text 
run array), then the look-up process is known to fail 
because it has already been unsuccessfully tried. Thus, 
the decision block 908 improves efficiency of the fall- 

35 back and/or default processing 900 by skipping the 
blocks 910 (the look-up process) and the block 912. 
[0088] In any event, the decision block 914 deter- 
mines whether there are additional target encodings to 
be considered. When the decision block 914 deter- 

40 mines that there are additional target encodings to be 
considered, the next target encoding is selected 916 
from the target encoding list. Following block 916, th 
fallback and/or default processing 900 returns to repeat 
the decision block 908 and subsequent blocks so that 

45 the conversion of the selected character into th 
selected target encoding can be again attempted via th 
look-up process. On the other hand, when the decision 
block 914 determines that there are no more target 
encodings to be considered, a default fallback character 

so is able to be used 91 8 for the selected target encoding. 
Following block 91 8, the fallback and/or default process- 
ing 900 is complete and returns. 
[0089] Although the look-up operations of the blocks 
712 and 910 are performed by the look-up handler (and 

55 perhaps the fallback handler) which use the mapping 
table for the associated target encoding, the look-up 
could also perform a preliminary check as to whether 
conversion is likely to be possible for a particular char- 



13 

BNSDOCID: <EP 0989499A2J_> 



25 



EP 0 989 499 A2 



26 



acter. As an example, the quick check can be achieved 
by a conversion availability table (for the associated tar- 
get encoding) such as illustrated in FIGs. 3A and 3B. 
Then, if the preliminary check were unsuccessful, then 
the look-up process for the particular character could be $ 
bypassed. Such a preliminary check may be able to 
improve average code conversion times. 
[0090] The code conversion system can be a compu- 
ter system or other electronic device for performing 
these code conversion operations. This computer sys- 10 
tern may be specially constructed for the required pur- 
poses, or it may be a general purpose computer 
operating in accordance with a computer program. The 
processing presented herein is applicable to any com- 
puter system or other electronic device. In particular, 15 
various general purpose computing machines may be 
used with software written in accordance with the teach- 
ings herein, or it may be more convenient to construct a 
more specialized electronic device to perform the 
required operations. 20 
[0091] FIG. 10 is block diagram of a representative 
computer system 1000 in accordance with the present 
invention. The computer system 1000 includes a central 
processing unit (CPU) 1002, which CPU is coupled bidi- 
rectionally with random access memory (RAM) 1004 25 
and unidirectionally with read only memory (ROM) 
1006. Typically RAM 1004 includes programming 
instructions and data, including tables as described 
herein, in addition to other data and instructions for 
processes currently operating on CPU 1002. The ROM 30 
1006 typically includes basic operating instructions, 
data and objects used by the computer system 1000 to 
perform its functions. In addition, a mass storage device 
1008, such as a hard disk, CD ROM, magneto-optical 
(floptical) drive, tape drive or the like, is coupled bidirec- 35 
tionally with CPU 1002. Mass storage device 1008 gen- 
erally includes additional programming instructions, 
data and text objects that typically are not in active use 
by the CPU, although the address space may be 
accessed by the CPU, e.g., for virtual memory or the 40 
like. Each of the above described computers further 
includes an input/output source 1010 that typically 
includes input media such as a keyboard, pointer 
devices (e.g., a mouse or stylus) and the like. The com- 
puter system 1000 can also include a network connec- 45 
tion 1012 over which data and instructions can be 
transferred. Additional mass storage devices (not 
shown) may also be connected to CPU 1002 through 
network connection 1012. The computer system 1000 
further includes a display screen 1014 for viewing text so 
and images generated or displayed by the computer 
system 1000. 

[0092] The CPU 1 002 together with an operating sys- 
tem (not shown) operate to execute computer code. The 
computer code may reside on the RAM 1004, the ROM 55 
1006, or a mass storage device 1008. The computer 
code could also reside on a portable program medium 
1016 and then be loaded or installed onto the computer 



system 1000 when needed. Portable program mediums 
1016 include, for example, CD-ROMS, PC Card 
devices, RAM devices, floppy disk, magnetic tape. 
[0093] The invention can also be embodied as com- 
puter readable code on a computer readable medium. 
The computer readable medium is any data storage 
device that can store data which can be thereafter be 
read by a computer system. Examples of the computer 
readable medium include read-only memory, random- 
access memory, CD-ROMs, magnetic tape, optical data 
storage devices, and data transmission media. The 
computer readable medium can also be distributed over 
a network coupled computer systems so that the com- 
puter readable code is stored and executed in a distrib- 
uted fashion. 

[0094] The invention has various advantages depend- 
ing on the aspects of the invention being implemented. 
One advantage of the invention is that it converts Uni- 
code text to multiple target encodings so that a high 
quality code conversion is achieved. Another advantage 
of the invention is that it is not dependent on having any 
font or style information that would assist in the code 
conversion. Still another advantage of the invention is 
that it is efficient and controllable such that various cri- 
teria and tolerances can affect the code conversion. 
Another advantage of the invention is that the invention 
can also choose appropriate target encodings for fall- 
back mappings. Yet another advantage of the invention 
is the ability to identify available target encodings on a 
computer system in an automated fashion. 
[0095] The many features and advantages of the 
present invention are apparent from the written descrip- 
tion, and thus, it is intended by the appended claims to 
cover all such features and advantages of the invention. 
Further, since numerous modifications and changes will 
readily occur to those skilled in the art, it is not desired 
to limit the invention to the exact construction and oper- 
ation as illustrated and described. Hence, all suitable 
modifications and equivalents may be resorted to as 
falling within the scope of the invention. 

Claims 

1. A code conversion system for converting a source 
string to a target string, said system comprising: 

a target encoding list (106; 220) containing 
available target encodings for said code con- 
version system; and 

a multi-encoding code converter (104; 200) 
that receives the source string and converts the 
source string into the target string, the target 
string including a plurality of encoding runs of 
different ones of the available target encodings. 

2. A code conversion system as recited in claim 1, 
wherein said code conversion system is performed 
by a computer system, and wherein said target 
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encoding list is automatically produced for the com- 
puter system. 

3. A code conversion system as recited in claims 1 or 
2, wherein said multi-encoding code converter 
(200) comprises: 

a converter for controlling the conversion of the 
source string having a first character encoding 
into the target string having a second character 
encoding; 

a scanner (208), operatively connected to said 
converter, for dividing the source string into text 
elements, each text element including one or 
more characters of the source string; 
a mapping table (214) for storing target encod- 
ings for text elements of the source encoding; 
and 

a lookup handler (212), operatively connected 
to said converter and said mapping table, for 
looking up in said mapping table a conversion 
code associated with a second character 
encoding for each of the text elements. 

4. A code conversion system as recited in claim 3, 
wherein said multi-encoding code converter further 
comprises: 

a fallback handler (216), operatively connected 
to said converter, for providing fallback conver- 
sion codes in certain cases, when said lookup 
handler is unable to provide a conversion code 
for one or more text elements, the fallback con- 
version codes contain one or more code points 
in the target encoding that are not exactly 
equivalent to the characters in the text element 
but have a graphical appearance that is similar. 

5. A code conversion system as recited in claims 3 or 
4, wherein said multi-encoding code converter fur- 
ther comprises: 

a scanner table (210), operatively connected to 
said scanner, for assisting said scanner in 
determining whether individual characters in 
the input string should be included within a cur- 
rent text element or alternatively begin a new 
next text element. 

6. A code conversion system as recited in any of 
claims 1-5, wherein the encoding runs of the avail- 
able target encodings for the target string are varia- 
ble based on predetermined criteria. 

7. A code conversion system as recited in claim 6, 
wherein the predetermined criteria comprises a 
user preference. 
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8. A code conversion system as recited in any of 
claims 1-7, wherein the source string has a Uni- 
code encoding. 

5 9. A code conversion system as recited in any of 
claims 1-8, wherein the characters in the source 
string are Unicode characters. 

10. A computer-implemented method for converting a 
10 source encoding to target encodings selected from 

available target encodings, said method compris- 
ing: 

(a) receiving a source text block, the source 
15 text block including a series of text elements; 

(b) selecting one of the available target encod- 
ings; 

(c) selecting one of the text elements from the 
source text block; 

20 (d) determining whether the selected text ele- 

ment can be converted into the selected target 
encoding; . 

(e) selecting a hext ^ne of the text elements 
from the source text block and repeating said 

25 determining (d) when said determining (d) 

determines that the selected text element can 
be converted into the selected target encoding; 
and 

(f) selecting another one of the available target 
30 encodings and repeating said determining (d) 

for the selected text element when said deter- 
mining (d) determines that the selected text 
element cannot be converted into the selected 
target encoding. 

35 

11. A computer-implemented method as recited in 
claim 10, wherein said method further comprises: 

(g) updating a text run array so as to identify the 
40 target encoding for the selected text element 

when said determining (d) determines that the 
selected text element can be converted into the 
selected target encoding. 

45 12, A computer-implemented method as recited in 
claim 1 1 , wherein the text run array provides a suit- 
able target encoding for different sequential runs of 
one or more of the text elements of the source text 
block. 

50 

13. A computer-implemented method as recited in any 
of claims 10-12, wherein the text elements includes 
one or more characters. 

55 14. A computer-implemented method as recited in any 
of claims 10-13, wherein the text elements are sin- 
gle characters. 
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15. A computer-implemented method as recited in any 
of claim 10-13, wherein the available target encod- 
ings are provided by a target encoding list. 

16. A computer-implemented method as recited in s 
claim 15, wherein said computer-implemented 
method is performed on a computer system, and 

wherein said target encoding list is produced 
by the following actions: 

10 

retrieving scripts that are installed on the com- 
puter system; 

retrieving fonts that are installed on the compu- 
ter system; 15 

determining a portion of the retrieved fonts to 
be added to the target encoding list; and 

producing the target encoding list from the 20 
retrieved scripts and the portion of the retrieved 
fonts. 

17. A computer-implemented method as recited in 
claim 10-16, wherein the source encoding is Uni- 25 
code. 

18. A computer readable medium including computer, 
program code for converting a source encoding to 
target encodings selected from available target 30 
encodings, said computer readable medium com- 
prising: 

first computer program code configured to 
receive a source text block, the source text 35 
block including a series of characters; 

second computer program code configured to 
select one of the available target encodings; 

40 

third computer program code configured to 
select one of the characters from the source 
text block; 

fourth computer program code configured to 45 
determine whether the selected character can 
be converted into the selected target encoding; 
and 

fifth computer program code configured to so 
select another one of the available target 
encodings and then to repeat said fourth com- 
puter readable medium using the newly 
selected target encoding when said fourth 
computer readable medium determines that 55 
the selected text element cannot be converted 
into the selected target encoding. 



19. A computer readable medium as recited in claim 

18, wherein said computer readable medium fur- 
ther comprises: 

sixth computer computer program code config- 
ured to select a next one of the characters from 
the source text block and then to repeat said 
fourth computer readable medium using the 
newly selected character when said fourth 
computer readable medium determines that 
the selected character can be converted into 
the selected target encoding. 

20. A computer readable medium as recited in claim 

19, wherein said computer readable medium fur- 
ther comprises: 

seventh computer readable medium configured 
to update a text run array so as to identify the 
target encoding used for the selected text ele- 
ment when said sixth computer readable 
medium determining determines that the 
selected character can be converted into the 
selected target encoding. 

21. A computer readable medium as recited in any of 
claims 18-20, wherein the source text block is a 
Unicode text block. 

22. A computer readable medium as recited in claim 
18-21, wherein said computer readable medium 
further comprises: 

computer program code for determining the 
available target encodings. 

23. A computer readable medium as recited in claim 
22, wherein said computer program code for deter- 
mining the available target encodings operates to 
retrieve scripts that are installed on the computer 
system, retrieve fonts that are installed on the com- 
puter system, determine a portion of the retrieved 
fonts to be added to the target encoding list, and 
identify the available target encodings from the 
retrieved scripts and the portion of the retrieved 
fonts. 

24. A computer readable medium as recited in any of 
claims 22-23, wherein the available target encod- 
ings are provided in an order of preference. 
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